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Nonparametric Methods for Doubly Truncated Data 

Bradley Efron and Vahe Petrosian 

Abstract 

Truncated data plays an important role in the statistical analysis of astronomical 
observations as well as in survival analysis. The motivating example for this paper con¬ 
cerns a set of measurements on quasars in which there is double truncation. That is, 
the quasars are only observed if their luminosity occurs within a certain finite interval, 
bounded at both ends, with the interval varying for different observations. Nonpara¬ 
metric methods for the testing and estimation of doubly truncated data are developed. 
These methods extend some known techniques for data that is only truncated on one 
side, in particular Lynden-Bell’s estimator and the truncated version of Kendall’s tau 
statistic. However the kind of hazard function arguments that underlie the one-sided 
methods fail for two-sided truncation. Bootstrap and Markov Chain Monte Carlo tech¬ 
niques are used here in their place. Finally, we apply these techniques to the quasar 
data, answering a question about their long-term luminosity evolution. 


Key Words: tau test, Lynden-Bell estimator, bootstrap, hypothesis test, Markov chain Monte 
Carlo, quasars, luminosity evolution, self-consistency. 
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1 Introduction 


Figure 1 shows an example of doubly truncated astronomical data. The plotted points 
are redshifts Zi and log luminosities y* for n = 210 quasars, as explained in Section 6. Due 
to experimental constraints the distribution of each yi is truncated to a known interval Ri 
depending on Zi. Truncation means that we would not have learned of yds existence if it fell 
outside of region i?*. Many experimental situations lead to truncated data, see for example 
McLauren et ah (1991). The double truncation seen in Figure 1, where y* goes undetected 
if it is either too small or too large, is less common than one-sided truncation. The quasar 
data has still another kind of truncation, of the redshift values Zi, but that will not affect 
our discussion. 

We will describe nonparametric methods that answer to two related questions concerning 
truncated data: 

Question 1 How can we test whether or not the yds are independent of the Zisl 

Question 2 Assuming independence, how can we estimate the marginal distribution of 
the yds? 

The answers apply to all forms of truncation, including the double truncation of Figure 1. In 
fact some of our methods apply to all forms of data censoring and truncation, as mentioned 
at the end of Section 5. 

Question 1 turns out to be crucial for the scientihc question posed by the quasars. A 
positive dependence of luminosity on redshift would mean that earlier quasars were intrinsi¬ 
cally brighter, or in other words that quasars have been evolving toward a dimmer state as 
the universe ages. Figure 1 certainly seems to show a strong positive dependence between 
redshift and luminosity, but that appearance is forced on us by the truncation limits, which 
increase sharply from left to right. The tests described in Sections 2, 3, 5, and 6 will show 
that a statistically signihcant positive relationship remains after accounting for truncation. 
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type= 1 , k= 0 



Figure 1: Doubly truncated data; points represent redshifts and luminosities for 210 quasars, 
as described in Section 6; luminosity subject to lower and upper truncation as indicated by 
dashes (lower truncation limits shifted down .25 for clarity.) Is luminosity correlated with 
redshift? 
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though of much smaller magnitude than the hgure suggests. 

Turnbull’s important 1976 paper answers Question 2 for arbitrary patterns of data trun¬ 
cation and censorship. Turnbull uses a self-consistency algorithm to calculate the nonpara- 
metric maximum likelihood estimate (MLE) of the y distribution. His equations take on 
interesting forms for doubly truncated data, as discussed in Section 4. 

Both Question 1 and Question 2 have simple closed-form solutions in the one-sided case, 
where the truncation regions Ri extend upwards to infinity. Tsai (1990) and also Efron 
and Petrosian (1992, 1994) answer Question 1 using a version of Kendall’s tau statistic 
appropriate to one-sided truncation. Lynden-Bell’s (1971) answer to Question 2 in the one¬ 
sided case is closely related to the Kaplan-Meier estimate for censored data, see Efron and 
Petrosian (1994). 

Sections 2-5 extend these ideas to two-sided truncation (the extensions applying with little 
change to any pattern of multiple truncation.) The two-sided case is less tractable for both 
questions. The testing problem of Question 1 is particularly challenging computationally, 
and raises issues concerning the relationship of bootstrap methods to Markov Chain Monte 
Carlo (MCMC). 

Lynden-Bell’s method is completely nonparametric, as is the tau test, but it can be quite 
inefficient in some circumstances. Section 4 also discusses more efficient parametric estimates 
based on special exponential families, as in Efron (1996) and Efron and Tibshirani (1996). 

We return to the quasar data in Section 6. The testing and estimation methods developed 
in Sections 2-5 are used there to answer Questions 1 and 2. 


2 Permutation Tests for Independence 


Doubly truncated data of the kind shown in hgure 1 can be described as follows: the 
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observed data consists of n pairs {zi,yi), i/i a real-valued response and Zi in covariate, with 
observation i/i restricted to a known region Ri = [ui,Vi], 

dsit& = {{zi,yi) with yi e Ri = [ui,Vi] for i = 1, 2,..., n}. (2.1) 

The n quadruplets {zi,yi,Ui,Vi) are observed independently of one another. The regions Ri 
can depend on Zi, as in hgure 1, and the Zi values themselves can be subject to observational 
truncation. 

Question 1 concerns testing the independence hypothesis He,: that if we could observe 
the yds without truncation they would be independent and identically distributed (i.i.d.) 
according to some common density function f{y). Because of truncation, Ho says that yi 
has the conditional density of f\y) restricted to i?*. 


Ho: f\yi\Ri) = f{yi)/Fi for yi e Ri 

= 0 for yi ^ Ri (2.2) 

independently for i = 1, 2,..., n. Here 

F^= r f{y)dy. (2.3) 

Jui 

Question 2 asks us to estimate f{y) assuming that Ho is true. 

This section discusses permutation tests of the independence hypothesis Ho. Table 1 and 
Figure 2 show an artiheial example of truncated data involving n = 7 points that will be 
helpful in carrying out the discussion. 

A permutation of y = {yi, y 2 , ■ ■ ■, yn), say y* = {yl, y^,..., y^) is observable if the per¬ 
muted values all fall within their truncation regions, that is if 

y* e Ri for i = 1,2,... ,n (2.4) 
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Vi Ri 


1 

0.75 

[0.4, 2.0] 

2 

1.25 

[0.8, 1.8] 

3 

1.50 

[0.0, 2.3] 

4 

1.05 

[0.3, 1.4] 

5 

2.40 

[1.1, 3.0] 

6 

2.50 

[2.3, 3.4] 

7 

2.25 

[1.3, 2.6] 


Table 2.3: Table 1; An artificial example of truncated data involving n = 7 data points. 


In figure 1, and Table 2, we see that {ys, y2, yi, y^, 1 / 5 ,1/6, Ur) is observable, but not ( 1 / 2 , yi, ys, Vi, 1 / 5 , Ve, Vy)- 
We define 


y = set of observable permutations. (2.5) 

It turns out that y has 78 members in the seven-point example. 

Permutation tests of the independence hypothesis Ho are based on the conditional distri¬ 
bution of y* in y given the ordered values y^i), y( 2 ), • • •, y{n) of the observed response vector 
y, (assuming for convenience no ties), say 


y() = (//(l),//(2), • • • ,//(n)) (2.6) 

and also given the Zi and Ri values, 

z = {zi,Z 2 ,... ,Zn) and R = {Ri, R 2 ,..., Rn)- (2.7) 

Independence Lemma Suppose that Ho is true. Then the conditional distribution of y* 
given y(), z, and R is uniform over y. 

Proof According to (2.2), (2.3), an observable vector y* = {yl,yy ... ,?/*) has Ho density 
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fig2. 7-point example, showing mie 
from qselfS 
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numbers at right are the self-consistent dens est 


Figure 2: The seven-point example from Table 1; dots indicate points {zi,yi)-^ vertical bars 
show truncation regions Ri = [ujjUi]; dashed horizontal lines indicate the ordered y values. 
Numbers at right give the nonparametric MLE for the y distribution assuming that the 
independence hypothesis Ho is true, as explained in Section 4. 
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If y* G 3^ then (2.8) equals OiLi/(l/(*))/IliLiThis has the same value for all y* G 3^, 
which proves the lemma. 

Permutation tests of Ho are carried out in the usual way: we choose a test statistic *S'(y), 
with larger values of S indicating stronger disagreement with Ho, and compare the observed 
value s = S{y) with the set of permutation values 

‘5 = {5(y*),y*G3^} (2.9) 

The p-value of the test is the proportion of S exceeding s. 

The tau test. A particular choice of the test statistic >S'(y) was advocated in Tsai 
(1990) and Efron and Petrosian (1992, 1994), in the context of one-sided truncation. A pair 
of indices {i,j) is said to be comparable if Ui G Rj and yj G i?*. Define 

C = set of comparable pairs (2-10) 

and 

Y. sign[(|/i - yj){zi - Zj)] / #C. (2.11) 

( ^\ 

For untruncated data = and r is Kendall’s tau statistic. If we decide that the 

vv 

independence hypothesis Hg is false then r provides a convenient nonparametric measure of 
correlation between y and ; 2 , see Section 3 of Efron and Petrosian (1994); r = .429 for the 
seven-point example. 


Here we will use just the numerator of (2.11), 



r = 


( 2 . 12 ) 


- yj){^i - ^j)] 

(hj)&C 

as the statistic 5'(y) for testing Ho, calling this the tau test. For the seven-point example 
f = 3 and the 78 members of y have this distribution of r* values, 

f*: <3 =3 >3 

#: 63 8 7 

(2.13) 

The one-sided p-value for testing Ho is (7-|-8/2)/78 (splitting the probability atom at f* = f.) 

Permutation tests based on r or r are identical for one-sided truncation, since is the 
same for all observable permutations, but they can differ under two-sided truncation. The 
two test statistics give almost the same results for the quasar data. Section 6 also discusses 
other choices of S'(y) that have greater testing power. 

3 Approximate P-Values 

The p-value for the seven-point example was found by complete enumeration of the tau 
statistic over the set of observable permutations y. This becomes impossible for data sets 
much larger than n = 7, and we need convenient approximations in order to carry out the 
test. 

It seems like a good approximation should be easy to hud. The tau statistic (2.12) has 
permutation expectation zero, 

Eperm{f*} = 0, (3.1) 
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since interchanging any comparable pair of indices reverses the sign of the corresponding 
component of f ((3.1) is not true for f, (2.11)). Moreover the components of r are short¬ 
tailed so that the normal approximation 


f ~ iV(0, (3.2) 

is likely to have good accuracy, see Theorem 2 of Tsai (1990). However, there seems to be no 
convenient formula for the permutation variance, at least not in the doubly truncated 

case. A formula does exist when the truncation is only one-sided, as discussed below. 

We can use Markov Chain Monte Carlo (MCMC) methods to approximate That 

is we can generate a random walk on 3^ that eventually produces uniformly distributed 
permutation vectors y*, and then estimate by the empirical variance of T(y*). A 

particularly simple MCMC scheme starts at y* = y and proceeds as follows: (1) choose a 
pair of indices i,j at random; and (2) interchange the and components of y* if the 
resulting vector is in y (that is if y* G Rj and y* G Ri) otherwise keeping y* the same. 

Elementary Markov Chain Theory says that y* has its stationary distribution uniform 
on 3^, see Gelman and Rubin (1992). This assumes that y is connected by coordinate 
interchanges, a fact proved recently by Persi Diaconis and Ronald Graham (personal com¬ 
munication). A sufficient number of iterations of steps (1) and (2) make y* nearly uniformly 
distributed on 3^. Repeating this whole process some large number B of times, perhaps 
B = 800, gives independent vectors y*(l), y*(2),..., y*(R), distributed nearly uniformly 
over 3^. Then we can estimate the permutation variance by 


^2 i:f=i[r(y*(&))-r(-)]^ 

‘^perm n _ i 


and approximate the p-value of the tan test by 


[^(■) = 


6=1 


(3.3) 


p = l-d>(f). 


f 


"r(y)/ CTperm 


(3.4) 
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where $ is the standard normal cumulative distribution function (CDF). A more direct 
approach, but one that tends to require larger values of Bi is to estimate p as at (2.9), by 

P = #{r*>f}/B. (3.5) 

An MCMC analysis of the quasar data of Figure 1 was carried out by Duncan Murdoch, 
personal communication. The number of iterations required to compute T(0) = r/dperm 
with the same acuracy achieved in Figure 5 was, very roughly, 4,000,000. In any given 
situation it is hard to know how many iterations of steps (1) and (2) are required to make 
y* sufficiently uniform on 3^. Generating independent replicates y*{b) as in (3.3), which 
is very convenient for error analyses, may be quite inefficient in the MCMC context. An 
information reference on these difficulties and their possible remedies is Gilks et ah (1993) 
and the ensuing discussion. 

Section 5 discusses a bootstrap approximation for dpej-m that sacrihces the theoretical 
exactness of the MCMC algorithm for more efficient and more dehnite numerical results. 
Bootstrap estimates are used in section 6’s analysis of the quasar data. The bootstrap 
approach has the advantage of applying to any kind of truncated and/or censored data, as 
mentioned at the end of Section 5. 

Neither MCMC nor the bootstrap are needed in the case of one-sided truneation where 
the observational regions in (2.11) are all of the form i?* = [Mi,cx)). In this case there is a 
simple way to generate vectors y* uniform on 3^. Dehne the risk-set numbers 

Nj = #{f : Ui < y(^j) and Pi > |/q)} (3.6) 

Nj is the size of the risk set in the following sense: Looking at hgure 2, but with all 
the regions Ri now extending up to inhnity, we begin with the smallest y value, and 
work upwards. At each there are Nj observable choices of ‘T/’ to go with |/q), all of 
which are equally likely under Ho- These are the values of 2 : “at risk” for pairing with pi^jy 
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For example, there are N 2 = 3 possible choices of to go with 1 /( 2 ), namely 2, 3, or 4, the 
actual choice in hgure 2 being z = 4. Altogether there are N = fliLi members of y, 
N = 324 (= 3 ■ 3 ■ 3 ■ 3 ■ 2 ■ 2 ■ 1) for the seven-point example. 

A uniform choice of y* in y is accomplished by choosing Zi uniformly from the A* possible 
choices at each this makes it easy to simulate the permutation distribution for any test 
statistic 5'(y). The permutation variance of the tan statistic (2.12) turns out to be 

^ — 1 

t^perm = 4^V) where Vi = ^ (3.7) 

i=i 

as shown for example in Section 3 of Efron and Petrosian (1994). We will use Formula (3.7) 
in Section 5 to validate the accuracy of the bootstrap estimate of 

4 Estimating The Response Distribution 


Question 2 of the introduction asks us to estimate the distribution of the response variable 
y assuming that the independence hypothesis Ho is true. More precisely, we want to estimate 
the response density f{y) in (2.2), (2.3). This section discusses nonparametric and parametric 
estimates of f{y) when the data is doubly truncated. 

The nonparametric MLE is a discrete distribution putting all of its probability on the 
observed responses yi, 1 / 2 ,..., yn, Turnbull (1976). Let f = (/i, / 2 ,..., /„) be a distribution 
putting probability /* on y^, and let F = (Fi, F 2 ,..., F„) be the vector of observational 
probabilities Fi = probf{|/ G Ri}, so 


F = Jf 


(4.1) 


where J is the n x n matrix describing which y values are included in the regions Ri, 


12 



1 if Uj e Ri 
0 if Vj ^ Ri 


(4.2) 


Jij — 


According to (2.2), the log likelihood of the observed sample is 


n 

f-iognc/iZ-R) 


i=l 

Differentiating (4.3) with respect to fi, and using (4.1), gives 


df~U Pi'F, 


The maximum likelihood equations di/dfi = 0 can be expressed as 


(4.3) 


(4.4) 


1 

f 



(4.5) 


where j = (^, , j-) and = (^, , -^). Notice that i in (4.3) stays the same 

if f, and hence F, is multiplied by any positive constant, allowing us to ignore the constraint 
Yh=i /i = 1 in the derivation of (4.5). 

Equation (4.5) is the same as Turnbull’s self-consistency criterion. We can solve for the 
MLE f by beginning with any initial estimate and then iterating between (4.1) and (4.5) 
(remembering to re-scale after each application of (4.5) so that the estimate of f sums to 1). 
The nonparametric MLE f for the seven-point example is shown at the right edge of Figure 
2. The substantial differences between f and the untruncated MLE (.14, .14, ..., .14) are not 
intuitively obvious from the truncation pattern. 


The method just described is an EM algorithm, and often converges quite slowly to 
the MLE. An alternative algorithm is based on Lynden-Bell’s 1971 method for computing 
the MLE in the case of one-sided truncation. For notational convenience suppose that the 
y values are indexed in increasing order, so yi = |/(j), where we continue to assume that 
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there are no ties. Corresponding to density function f = (/i, / 2 ,..., /„), the survival curve 
G = (Gi, G 2 ,..., Gn) and the hazard function h = (hi, /i 2 ,..., h„) are dehned by 

Gj = J2fi and hj = fj/Gj (4.6) 

i>j 

We can recover G and f from h via the relationships 

Gj = exp{^ log(l - hi)} and fj = Gj - G^+i (4.7) 

i<j 

Here Gi = G(?/(i)) = 1 by dehnition. 

The following theorem is verihed in the Appendix. 

Hazard Rate Theorem The nonparametric MLE f has hazard function h satisfying 

1 ” ^ 

= iv, + y JyO. (4.8) 

hj i=i 

where Nj are the risk-set numbers (3.5), {Jp} are the inclusion indicators (4.2), and 

4 = 4,+ / F, (4.9) 

Here F = Jf as in (4.1), and = E{/fc : Vk > Vi}. 

The numerator of Qi is the MLE probability of exceeding Vi, the upper observational 
limit for yi. In the one-sided truncation case Qi = since Vi = 00 , so equation (4.8) takes 
the form 


f = N,, ( 4 . 10 ) 

hj 

which is Lynden-Bell’s (1971) estimate. In this case (4.7) gives the MLE f directly, without 
iteration. In the case of two-sided truncation we can begin with (4.10) and iterate (4.7), 
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(4.8) to obtain the MLE. This converges quickly if the upper truncation is not severe, as 
turns out to be the situation for the quasar data. 

The solid curve in Figure 3 shows logG(|/), the nonparametric MLE of log(J2y.>yfi), 
for the 210 quasars. The log luminosity y is not the same quantity plotted in hgure 1, but 
is an adjusted version as explained in Section 6. Also shown is the Lynden-Bell estimate 
(4.10), dashed line, which ignores upper truncation. The two estimates are almost the same 
so in this case upper truncation has little effect. The MLE algorithm based on (4.7)-(4.8) 
converges very quickly here. 

It is not an accident that the Lynden-Bell estimate of survival is everywhere less than 
the MLE. 

Corollary The Lynden-Bell estimated survival curve, which ignores upper truncation, 
is less than or equal to the nonparametric MLE. 

The proof of the corollary is immediate from a comparison of (4.8) with (4.10), which 
shows that the estimated hazard rate can only be greater in the Lynden-Bell case. 

The curve labeled “SEE” in Figure 3 is derived from a special exponential family in the 
terminology of Efron and Tibshirani (1996). In this case the SEE density estimate f{y) is 
the MLE among densities of the form 

^og f {y) = Po + piy + p 2 y‘^ + , (4.11) 

for y in the range of the observed yi values. That is, it maximizes nr=i(/(l/*)/-^*) (2-2), 

(2.3) among all choices of (/3i, P2, PP) (with Pq then determined by the requirement that f{y) 
integrate to 1 over the interval [y^ipy^n)])- The appendix describes the calculation of f{y). 

Lynden-Bell’s estimate and the parametric MLE can behave erratically near the extremes 
of the y range, where the risk set numbers Nj may be small. At the arrowed point in hgure 
3 for example Ni = 2, giving Lynden-Bell estimates hi = .50 and G{y( 2 )) = -50 according 
to (4.10) and (4.7). The nonparametric MLE has G{y( 2 )) = -51. The SEE estimate smooths 
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figs 



adjusted log luminosity 


Figure 3: Estimated log survival curve logG(?/) as a function of the adjusted log luminosity 
evolution yj for the quasar data, adjusted for luminosity evolution: 6* = 2, as explained in 
Section 6; solid curve is nonparametric MLE; dashed curve is Lynden-Bell estimate (4.10), 
ignoring upper truncation; smooth dotted curve is cubic special exponential family. 
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out the bumps in the nonparametric MLE, here giving G{y{ 2 )) = -72. 

SEF estimates are less variable than the nonparametric MLE, but can be biased if based 
upon an incorrect model. The cubic model in Figure 3 was chosen by successive signihcance 
tests, as explained in the appendix. A small simulation study showed that percentile points 
of the cubic SEF estimates had roughly half the standard deviation of their nonparametric 
MLE counterparts. 


5 Bootstrap Tests for Independence 


We now return to the question of testing the independence hypothesis Ho, (2.2). this 
section discusses bootstrap approximations to the permutation p-values of Section 2. Most 
of the discussion is in terms of the tau test, but the method applies to any test statistic as 
mentioned at section’s end, and can be extended to arbitrarily complicated patterns of data 
censoring and truncation. 

Let f{y) be an estimate of the density for the response variable y calculated as in Section 
4, assuming that Ho is true. The estimate / might be the nonparametric MLE or an SEF 
estimate as in Figure 3. We can use / to draw a bootstrap sample y* by following the recipe 
in (2.2), (2.3): y* = {yl,yl, ■ ■ ■ ,yG) has independent components, with the component’s 
density being 

KvD/Fi fori/r 

0 for y* 

where Pi = /^, f{y)dy. 

An approximate version of the tau test can be carried out by generating B independent 
bootstrap samples y* and then proceeding as in (3.3), (3.4) to get the approximate p-value 
I? = 1 — <h(T). How big need B be? Letting 


G Ri 
^Ri 


independently i = 1, 2,..., n 


(5.1) 
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T = r(y)/aperm 


and T = f(y)/dperm, 


(5.2) 


standard normal-theory calculations show that 


sd,{f/T} = 1/Vw, (5.3) 

where sd^ indicates the simulation standard deviation. This gives 

B: 50 100 200 400 800 

sd^ 0.10 .07 .05 .035 .025 

(5.4) 

SO we need B = 800 bootstrap replications to determine T within about 2.5% of its ideal value 
T. These same calculations apply to the MCMC approach in (3.3), (3.4). B = 800 is large 
enough to permit a check in the normality assumption leading to the estimate p = 1 — <h(T), 
and a retreat to the nonparametric estimate (3.5) if normality looks dubious. 

As a check on the bootstrap test, dperm was estimated using B = 800 replications for the 
data in Figure 1, but ignoring the upper bounds of the truncation regions (i.e. taking all 
Vi = cx)). The estimate was dperm = .0441 ± .0011, “±” indicating the bootstrap simulation 
error, agreeing nicely with the exact permutation standard deviation (Jperm = .0445 obtained 
from (3.7). 

The bootstrap approximation for (Jperm looks much different than the MCMC approach 
of Section 3. In particular it requires a preliminary estimate of f{y)- The brief discussion 
in the appendix shows that the bootstrap and permutation calculations are more alike than 
they appear, and that similar approximations are used in more familiar statistical contexts. 

There is nothing special about the tan statistic as far as bootstrap or permutation meth¬ 
ods are concerned. Section 6 mentions a more powerful version of tan that puts greater 
weight on those terms in (2.11) having bigger values of \zi — Zj\. We could in fact employ 
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any test statistic >S'(y), and then use the bootstrap or MCMC to approximate the compar¬ 
ison set S and p-value in (2.9). The main advantage of tan-like test statistics is that they 
have bootstrap or permutation expectations zero under Ho, and nearly normal distributions, 
which eases the task of hnding p-values by simulation. 

Tsai (1990, Section 4) shows that the tan test of independence can be applied to data 
that is truncated below and censored above. In principle, bootstrap tests can be applied 
to any form of truncated and/or censored data. The hrst step is to estimate the response 
density f{y) using Turnbull’s (1976) nonparametric self-consistency algorithm. Then boot¬ 
strap samples y* are drawn as in (5.1), taking into account each y/s pattern of truncation 
and censoring, and the p-value approximated by comparing the test statistic 5'(y) with 
S = {S{y*{l)), S{y*{2)),..., S{y*{B))}. Romano (1988) gives a general discussion, and 
validation, of this kind of “null hypothesis bootstrap” test procedure. 


6 The Quasar Data 


Our estimation and testing theory will now be applied to the quasar data of Figure 
1. First the situation will be described more carefully. The original dataset consisted of 
independently collected quadruplets 

{zi,mi,ai,bi) i = l,2,...,n (6.1) 

where Zi is the redshift of the quasar and m* is its apparent magnitude. The numbers 
tti and bi are lower and upper truncation limits on mj. Quasars with apparent magnitude 
above bi were too dim to yield dependable redshifts (remembering that bigger values of m 
correspond to dimmer objects.) The lower limit ai was used to avoid confusion with non¬ 
quasar steller objects. In this study ai = 16.08 for all i, while bi varied between 18.494 and 


19 



18.934. The full dataset comprised n = 1052 quasars. Here we are considering a randomly 
selected subset of n = 210 quadruplets (6.1). 

Farther quasars appear dimmer of course, that is they tend to have bigger values of m*. 
Hubble’s law, which says that distance is proportional to redshift, allows us to transform 
apparent magnitudes into a luminosity measurement that should be independent of distance. 
This transformation depends on the cosmological model assumed. The log luminosity values 
Hi in Figure 1 were obtained from a formula yi = t{zi, rrii) that takes into account relativistic 
effects of the distance. 


y, = 19.894 - 2.303|| + 2 log(Z, - ^ log(Z,), (6.2) 

where Z, = 1 + Zi. Formula (6.2) is derived from the Einstein-deSitter cosmological model, 
Weinberg (1972). The last term makes the so-called K-correction, taking into account the 
shifting of the spectrum due to redshift. 

Larger values of yi correspond to intrinsically brighter quasars in Figure 1. The truncation 
limits Ri = [ui, n*] were obtained by applying transformation (6.2) to the observational limits 
for TTlj, 


Ui = t{zi,bi) and Vi = t{zi,ai) (6.3) 

This makes Ui and n*, the lower and upper dashes in hgure 1, strongly increasing functions 
of Zi, even though a* and 6* are not. 

One of the principal goals of the quasar investigations is to study luminosity evolution: 
quasars may have been intrinsically brighter in the early universe and evolved toward a 
dimmer state as time went on. This would tend to make the points on the right side of 
Figure 1 higher since larger redshifts correspond to quasars seen longer ago. However in the 
absence of luminosity evolution we should have yi independent of Zi except for truncation 
effects. This brings us back to Question 1 of the introduction. Testing the independence 
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hypothesis Hg amounts to testing for the absence of luminosity evolution. 

A convenient one-parameter model for luminosity evolution says that the expected log 
luminosity increases linearly as 6 ■ log(l -|- z), with 6 = 0 corresponding to no evolution. If 6 
is a hypothesized value of the evolution parameter then instead of yi being independent of 
Zi we should test the null hypothesis , 


He '■ = yi~ d ■ log(l -|- Zi) independent of Zi (6.4) 

Figure 4 shows plots of the data for 9 equal 0, 2, and 4, all for our same set of 210 quasars. 
Notice that the truncation regions Ri = {ui,Vi) also change with 9, 

Ui{9) = Ui — 9 ■ log(l -|- Zi) and Vi{9) = Vi — 9 ■ log(l -1- Zi) (6.5) 

We can apply the tau test to each of the null hypotheses Hg. This was done for values 
of 9 between 0 and 4 with the results shown in hgure 5. The solid curve is the standardized 
test statistic T = f/dperm, (3-4), with dperm determined by bootstrap sampling. B = 800 
bootstrap replications (5.1) were drawn for 9 = 0, .5,1,... ,4. Almost exactly the same curve 
was obtained using f,, (2.11), in place of r. 

We see that T(0) = 2.20, giving an approximate one-sided p-value 1 — $ (2.13) = .015. 
The tau test rejects the null hypothesis of independence Hg in favor of a positive value of 
the luminosity parameter 9. hi 9 = 2.38 we have T{9) = 0. The T{9) curve crosses ±1.645 
at [1.00, 3.20], which provides an approximate 90% central conhdence interval for 9. As a 
point of comparison, using all 1052 quasars gave 9 = 2.11 and 90% interval [1.38, 2.63]. 

If we are willing to ignore the upper truncation limits (setting all the Vi = cx)) we can 
employ the more exact one-sided tau test (3.5)-(3.6). These results, which did not involve 
bootstrap sampling, were only slightly more signihcant than those for the two-sided test, as 
shown by the dots in Figure 5. 

The choice 9 = 2 makes the adjusted log luminosity yi{2) = yi—2 log(l± 2 :i) approximately 
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theta= 0 


theta= 2 


theta= 4 



Figure 4: Plots of the quasar data of Figure 1 for three choices of the luminosity evolution 
parameter 9] 9 = 0 corresponds to Figure 1; other values of 9 plot {zi,yi{9)), (6.4), with 
limits Ui{9) and Vi{9), (6.5). 
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fig5 



0 12 3 4 


Figure 5: Tau test of the null hypotheses Hg, (6.4), for the 210 quasars; 9 = 0, 0.5,1,..., 4; 
solid curve T = r/dperm, (3.4), crosses zero at 9 = 2.38; dperm found by bootstrap sampling 
(5.1); 90% central conhdence interval 9 G [1.00,3.20]. Dots indicate T values assuming only 
one-sided truncation. 
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independent of Zi so we can estimate its density fiy) as in Section 4. Figure 3 shows the 
estimated log survival curve logG. It curves sharply downward for large values of y, which 
is a fortunate thing for the testing results. If logG were linear then f{y) would correspond 
to a one-sided exponential density. Because of the exponential’s memoryless property it is 
impossible to test for independence in the exponential case (unless the lower endpoint of the 
exponential density exceeds some of the lower truncation limits Ui). 

On the other hand, the MLE f{y) consistently estimates an exponential tail even if 
yi and Zi are not independent. It is a good idea to estimate fiy) or G{y) even if Question 
1 is of primary interest. Should the MLE turn out to be exponential then any testing 
procedure will be futile. 

Model (6.4) gives us reason to question the efficacy of the tan statistic in this situation. 
Let do be the true value of the evolution parameter. Then the difference of two of the 
luminosity measurements in Figure 1, where 0 = 0, can be expressed as 

Vi-Vj = yi{do) - yj{9o) + 9o{wi-Wj) [w* = log(l 4-^i)] (6.6) 

The differences yi{9o)—yj{9o) are symmetrically distributed about zero, except for truncation 
effects so (6.6) suggests that (i, j) pairs with bigger values of \wi—Wj\ will be more informative 
in testing for deviations from Hq. An analysis of the 4907 comparable pairs for the data in 
Figure 1 verihed that those with large values of \wi — Wj\ were contributing more consistently 
to the tan statistics: 


\Wi-Wj\ : 


0 .25 .5 .75 1 

pioh{sign{yi — yj){zi — Zj) = 1} .51 .56 .59 .62 .63 


A modihed version of the tan statistic (2.11) was tried on the quasar data. 


H I log(wi/Wj) I sign [{yi - yj){zi - Zj)] / I logKM)l> 
c c 


(6.7) 
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but it showed only modest improvements over either f or f. The gains were more substantial 
when all 1052 quasars were included, giving standardized statistic T' = 3.21 compared to 
f = 2.77 for testing Ho. 
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Appendix 


A Proof of the hazard rate theorem (4.8) 

Because the nonparametric MLE puts all of its probability on the observed responses, and 
because it is invariant nnder monotonic transformation of the y scale, we can assnme that 
yi = i for i = 1,2,..., and that Ri = [ui, Vj] where Ui and u* G {1, 2,..., n} (again assnming 
no tied y valnes.) In addition to Nj = : Ui < j and i > j}, (3.5), we dehne 

Mj = : Ui = j} (A.l) 

It is easy to see that 

Ml = Ni and Mj = Nj — Nj_i + 1 for j = 2, 3,..., n. (A.2) 

We will also assume that 


Mn = 0, (A.3) 

because if this were not trne then the largest observation yi = n wonld have been trnncated 
to the degenerate interval = [n,n], and we conld rednce the sample size to n — 1. (A3) 
implies Nn-i = + 1 = 2 according to (A2), since = 1. 

Following the notation of section 4 , a density f = (/i, /2, • • •/n) has likelihood L = 
riiLi fi/Fi foi’ the observed sample. The hazard fnnction h = (hi, ^2, ..., h„) has h„ = 1, 
so we need only consider its first n — 1 coordinates. The following lemma expresses the 
likelihood in terms of the hazard rate. 

Lemma The likelihood corresponding to (/i, / 2 ,..., /„) or (hi, h 2 ,..., h„_i) is 
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n £ n—1 n vi 

i=n 4=in / [iiti - n(i - (a.4) 

i=l j=l i=l Ui 

Proof In the one-sided case of no upper truncation, where all the Vi = n, the denominator 
in (A4) equals 1 (since = 1), so that (A4) becomes 


n 


n 

i=l 

This is easy to verify, starting from 


A 

F^ 

hj 


n—1 

n A,(l - hi)’''’-' 


i=i 


fj/Gf 


(A.5) 


n—1 

u hj{l - 


t=i 


/. A, 


n 


i+i \Ni-i 


LI ^ \ ^ 

j=l 

n n—1 


n/./ncf. 

i=i 


n-1 QNn-l-1 

t A i t /^Nl ^7V2 + 1 ——2 + 1^"^*^^ 

j=l ? ^2 * * * ^n—1 


(A,7) 


where we have used Gn = fn, -^n-i = 2, (A2), and (A3). But in the one-sided case 

np'GA = nr=i Fi, since each Fi must equal some Gj, so (A6) proves (AS). 

In the two-sided case 


nr n £ n ^ 

n# = [n 7 ^] tn 


(A.8) 


J.J. F. LJ.J. ^ J LJ.J. p J- 

i=l ^ * i=l i=l 4 I 

But the first bracketed factor on the right is the same as nr=i(/*/-^*) for the one-sided case, 
and formula (4.7) says that 


a, 




Vi 




-1 


Fi Gui — Gvi- 

which together give the lemma (A4). Finally, differentiating the log likelihood 


(A.9) 


n—1 


fog ^ = II [ fog hj + {Nj - 1) log (1 - hj)] - fog [1 - 11(1 - hk)]. (A.IO) 

j=l i=l 


Ui 
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gives 


d\ogL 
dhi 


hi 


1 — hi 


iw-i)+1: 


G 


Vi-i 


V "'1 ^ 'V i:jeRi 

which is equivalent to the hazard rate theorem (4.8), (4.9). 


(A.ll) 


The likelihood equality (A.5) for the one-sided case can be obtained directly from fa¬ 
miliar survival analysis arguments involving successive conditionally independent binomial 
likelihoods. See section 2 of Efron (1988). In the two-sided case successive conditioning gives 
the messier expression 


r=in rid-Miin 

j=ii£Afj j=i ^ 'bi 

when hij is the hazard function for Ui, 

= E A. (V13) 

j<k<Vi 

and Afj = {i : Ui < j and i > j}, the risk set”. The lemma says that this reduces to 
(A4). 

Special Exponential Families Classic exponential families such as the normal, Pois¬ 
son, and binomial have mathematical properties that greatly simplify their use. Modern 
computational equipment allows us to design special exponential families (SEE) for partic¬ 
ular applications without worrying about mathematical tractability. The SEE appearing in 
Figure 3 has a density of the form 

f^{y) = e’?'h 3 /)-<AC) for yey (A. 14) 

where t{y) = rj = ( 771 , 772 , 773 ), TZ = [ min (i/j), max( 7 /i) ] and 0 ( 77 ) is chosen to 

make fy f^{y)dy = 1 



The cubic SEF shown in Figure 3 used the MLE fj of the parameter vector 77 . In order 
to calculate f/ we dehne 


r,(9) = / (A.15) 

Jn 

where g{y) may be a vector or matrix function of y in which case the integrals in (A. 13) are 
carried out component-wise. Then the score function ^n{y) = ^ log/??(y) for fhe truncated 
data structure (2.2), (2.3) is 


4(y) = yi t(vi) - T,{i, ■ t) / r,(/.)], (A. 16 ) 

i=l 

where Ii{y) is the indicator function for i?*, as in Efron and Tibshirani (1996). The second 
derivative matrix is given by 


- 4(y) = ■ i") / r,(/.) - I T,(I, . t) / T^(h)) in, (A,17) 

i=l 

with indicating the n x n outer product matrix (f* tj). 

The MLE fj, satisfying ii^{y) = 0, is found by Newton-Raphson iteration 

y{k + 1) - y{k) = [ -%fc)(y) ]“^ %fc)(y) , (A. 18) 

These calculations go quickly because they involve only one-dimensional integrals (A. 13), 
though many of them. Successive models t{y) = yj t{y) = {y,y^), ■ ■ ■ were tried on the data 
for Figure 2. Standard hypothesis tests led to the choice of a cubic model. These tests used 
the estimated covariance matrix [ —^'^^(y) from (A.15) to assess the signihcance of the fj 

coefficients, e.g. whether or not fj^ is signihcantly non-zero for the cubic model. 

Note The nonparametric MLE is itself an SEF estimate, where the components of t{y) in 
(A.12) are delta functions on the observed yi values, t{y) = {S{y—yi), S{y—y 2 ), ■ ■ ■, S{y—yn))- 
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Bootstrap and Permutation Calculations Section 5 discusses bootstrap approxima¬ 
tions to permutation tests for the null hypothesis of independence Ho- The situation there, 
involving doubly truncated data, looks complicated but in fact it is perfectly analogous to 
much simpler and more familiar statistical problems. 

Consider the problem of testing the equality of two binomial probabilities. We observe 
independent Bernoulli variates 

■ - Vim ~ (A.19) 

1/21,1/22, •• •l/2n2 ~ Bi{l,Tl) (A.20) 

and wish to test the null hypothesis Ho that tti = 712 = tt (say). Letting xi = yu and X 2 = 
YJi=i y 2 i we can arrange the data in a 2 x 2 table, first row {xi,ni — 1 / 2 ) and second row 
{x 2 ,n 2 — X 2 ), in which case Ho is the usual hypothesis of independence for the table. 

If Ho is true than x+ = X 1 +X 2 is a sufficient statistic and tt) with n+ = ni-|- 

n 2 . Moreover all n+! permutations of the combined data vector y = ( 1 / 11 , 1 / 12 , • • • ,l/ 2 n 2 ) are 
equally likely given x+, allowing us to construct permutation tests for Ho without worrying 
about the nuisance parameter tt. The permutation test based on the statistic 

S = pi-P2 [Pi = xi/ni P2 = X2/n2] (A.21) 

is Fisher’s exact test for independence. S has permutation expectation zero and variance 

t^perm = . ^(1 “ ^)(— + “), [ ^ = X+/n+] (A.22) 

^ n+ — 1 Til n 2 

Suppose we did not know formula (A20) and wished to approximate by bootstrap 
simulation. The null hypothesis bootstrap replaces (A18) with 

1/n, 1/1*2, •••,l/ 2 n 2 ~ Bt{l,n), (A.23) 
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giving 


ni n2 

S’=pI-p'2 = Y1 S/iVm - E ViJn-P (A.24) 

i=l i=l 

S* has bootstrap expectation zero and variance 

S^perm = ^(1 “ ^) (“ + “)• (A.25) 

^ ni n2 

The usual statistic for testing independence is exactly 

The bootstrap sampling is unconditional in the sense that the samples do not necessarily 
have x\ = x+. Nevertheless we get an approximation that is accurate to second order, 

?perm/t^perm = 1 + C>(l/n+). (A.26) 

It can be shown that this is no accident, and that second-order accuracy follows for any 
situation of the following sort: the observed data can be written as {A,B) where under 
Ho, A has an exponential family of densities and the conditional density f{B\A) does not 
depend on the parameters of the exponential family. 

In the binomial situation A = Xj^ and B is a permutation of a vector of x^ I’s and n+ —x^ 
O’s. Under Ho, A ~ Bi{n^,Ti) and B\A is uniform. It is easiest to see the connection with 
our truncated data problem through the SEF formulation. Formula (A. 13) leads to an 
exponential family of densities for y under the truncated data model ( 2 . 2 ), with sufficient 
statistic A = As the exponential family (A.13) grows larger, say by including 

more polynomial terms in tiy) = [y, y'^, , ...), A itself gets bigger, but we can always 

take A = y( ), the order statistic of y. But A = y( ) is equivalent to taking A = y (since 
the regions Ri are known ancillaries), as we did for the permutation tests of section 2. 
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